Question: What might affect the chance of getting heart disease?

About the data
  • We have the data from Western Collaborative Group Study contains 3154 healthy men, aged from 39 to 59, from the San Francisco area.
  • At the start of the study, all were free of heart disease.
  • Eight and a half years later, the study recorded whether these men now suffered from coronary heart disease (chd).
  • Other recorded variables that might be related to the chance of developing this disease are:
    • age - age in years
    • height - height in inches
    • weight - weight in pounds
    • sdp - systolic blood pressure in mm Hg
    • dbp - diastolic blood pressure in mm Hg
    • chol - fasting serum cholesterol in mm %
    • behave - behaviour type (A1, A2, B3, B4)
    • cigs - number of cigarettes smoked per day
    • dibep - behaviour type (A = Aggressive, P = Passive)
    • typechd - type of coronary heart disease (angina, infdeath, none, or silent)
    • timechd - time of coronary heart disease or end of follow-up
    • arcus - arcus senilis (absent or present)

Load the data

  • Is this data experimental or observational?

Explore the data (initial data analysis)

  • There are a number of ways to explore data.
  • First let’s start by looking at the distribution of the outcome (coronary heart disease). What do you notice?
  • Next let’s looks at the marginal distribution of numerical covariates by the outcome. Why is the median value for timechd clustered around 3000 for those that didn’t have a coronary heart disease?
Answer

The study went on for 8.5 years (which is about 8.5 \(\times\) 365.25 = 3105 days). The variable timechd for the group that didn’t have a coronary heart disease is related to number of days to end of follow-up. This means that those with earlier time (e.g. 1000 days) within this group means that they dropped out of study for some reason (e.g. due to death or unable to contact participant).

  • Let’s look also at the marginal distribution of categorical variables by the outcome. Do the levels within a factor have a higher association with chd? What do you notice in particular about typechd?
Answer typechd is derived from chd so those that did not have a coronary heart disease is all assigned “none” as expected.
  • ggpairs() from GGally package is useful for looking at the pairwise relationship between any two covariates in the data. It will be slow to compute if you have many variables (and individual graph may be too slow to see) so you may need to subset to a smaller number of variables first.
  • Let’s zoom in and have a look at the relationship between sdp and dbp. What do you notice about the relationship between these variables?
Answer The variables sdp and dbp are highly correlated. Systolic pressure (sdp) is the maximum blood pressure during contraction of the ventricles; while diastolic pressure (dbp) is the minimum pressure recorded just prior to the next contraction. Both measure blood pressure (in different ways) so the high correlation is perhaps expected!
  • What is the relationship between behave and dibep?
Answer All those that are assigned with value “A” for dibep are either “A1” or “A2” for behave. Similarly all that are assigned with value “B” for dipep are either “B1” or “B2” for behave. This suggests that perhaps behave was a further refinement of behaviour type based on dibep (so behave is nested within dibep).

Model the data

  • Why would you not (or would you) use typechd and timechd as predictors in the model?
Answer

The variables typechd and timechd are calculated based on the outcome! You can’t use predictors that were derived based on the outcome.

  • Do you think the behaviour type variable (behave) should be included in the model?
Answer behave does not appear to be (statistically) significantly contribute to explaining the chd so we omit from the model.
  • What do you think the best model that explains chd is?
Answer

Variable selection is a hard problem! We can use stepwise selection (which goes through backward selection - drop the least signiciant variable - and forward selection - add the most significant variable, until it meets a certain criteria), but an “automated” selection like this doesn’t account for the domain context.

For example, sdp is in the final model selected by stepwise selection but sdp is highly correlated with dbp. Should dbp been included instead of sdp? Also weight is included but not height. Tall people would naturally weight more than short people with the same body type. Would it have been better to feature engineer another variable that normalises weight with respect to height? Body mass index (which is weight in kg divided by square of body height in metres) is supposed to account for weight with respect to height.

Remember that all models are wrong, but some are useful. Your goal is just to find a useful model that approximates the reality well enough.